AITopics | refusal feature

Collaborating Authors

refusal feature

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Elastic Robust Unlearning of Specific Knowledge in Large Language Models

Neural Information Processing SystemsJun-14-2026, 15:24:03 GMT

LLM unlearning aims to remove sensitive or harmful information within the model, thus reducing the potential risk of generating unexpected information. However, existing Preference Optimization (PO)-based unlearning methods suffer two limitations. First, their rigid reward setting limits the effect of unlearning.

arxiv preprint arxiv, large language model, machine learning, (19 more...)

Neural Information Processing Systems

Country: Europe > France (0.14)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Therapeutic Area > Infections and Infectious Diseases (1.00)
Health & Medicine > Therapeutic Area > Immunology (1.00)
Government (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Safety-Aligned Weights Are Not Enough: Refusal-Teacher-Guided Finetuning Enhances Safety and Downstream Performance under Harmful Finetuning Attacks

Ham, Seokil, Choi, Yubin, Yang, Yujin, Cho, Seungju, Kim, Younghun, Kim, Changick

arXiv.org Artificial IntelligenceOct-14-2025

Recently, major AI providers such as Google and OpenAI have introduced Finetuning-as-a-Service (FaaS), which allows users to customize Large Language Models (LLMs) using their own data. However, this service is vulnerable to safety degradation when user data includes harmful prompts, a threat known as harmful finetuning attacks. Prior works attempt to mitigate this issue by first constructing safety-aligned model and then finetuning the model on user data. However, we observe that the safety-aligned weights provide weak initialization for downstream task learning, leading to suboptimal safety-alignment and downstream task performance. To address this, we propose a Refusal-T eacher (Ref-T eacher)-guided finetuning framework. Instead of finetuning a safety-aligned model on user data, our approach directly finetunes the base model under the guidance of a safety-aligned Ref-Teacher, which filters harmful prompts from user data and distills safety-alignment knowledge into the base model. Extensive experiments demonstrate that our Ref-Teacher-guided finetuning strategy effectively minimizes harmful outputs and enhances finetuning accuracy for user-specific tasks, offering a practical solution for secure and reliable deployment of LLMs in FaaS. Recent advancements in Large Language Models (LLMs) (Touvron et al. (2023); Jiang et al. (2023); Team et al. (2024); Team (2024); Hurst et al. (2024); Guo et al. (2025); Research et al. (2025)) have achieved remarkable performance across a wide range of natural language processing tasks. LLMs are typically pretrained on massive and diverse corpora, resulting in strong generalization ability and broad applicability across domains. To further facilitate LLMs for individual and domain-specific purposes, major AI service providers such as Google and OpenAI offer not only access to pretrained LLMs but also Finetuning-as-a-Service (FaaS). This service enables users to upload custom datasets and adapt LLMs to more specific tasks and domains depending on their unique requirements. However, FaaS must prevent the malicious use of LLMs through safety-alignment, even when users attempt to jailbreak the models via customization. These types of attacks, which inject harmful prompts into user data for finetuning, are called harmful finetuning attacks. Several studies (Qi et al. (2023); Lermen et al. (2023); Rosati et al. (2024); Huang et al. (2024b;c;d); Li et al. (2025); Huang et al. (2025)) have shown that finetuning on user data containing harmful content compromises the safety-alignment, despite the LLMs being safety-aligned before finetuning.

large language model, machine learning, user data, (17 more...)

arXiv.org Artificial Intelligence

2506.07356

Genre: Research Report (1.00)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (0.44)

Add feedback

Understanding Refusal in Language Models with Sparse Autoencoders

Yeo, Wei Jie, Prakash, Nirmalendu, Neo, Clement, Lee, Roy Ka-Wei, Cambria, Erik, Satapathy, Ranjan

arXiv.org Artificial IntelligenceMay-30-2025

Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in https://github.com/wj210/refusal_sae.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.23556

Country: North America > United States > California > Los Angeles County > Los Angeles (0.14)

Genre: Research Report > New Finding (0.66)

Industry: Law (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks

Yi, Xin, Li, Yue, Wang, Linlin, Wang, Xiaoling, He, Liang

arXiv.org Artificial IntelligenceJan-17-2025

Ensuring safety alignment has become a critical requirement for large language models (LLMs), particularly given their widespread deployment in real-world applications. However, LLMs remain susceptible to jailbreak attacks, which exploit system vulnerabilities to bypass safety measures and generate harmful outputs. Although numerous defense mechanisms based on adversarial training have been proposed, a persistent challenge lies in the exacerbation of over-refusal behaviors, which compromise the overall utility of the model. To address these challenges, we propose a Latent-space Adversarial Training with Post-aware Calibration (LATPC) framework. During the adversarial training phase, LATPC compares harmful and harmless instructions in the latent space and extracts safety-critical dimensions to construct refusal features attack, precisely simulating agnostic jailbreak attack types requiring adversarial mitigation. At the inference stage, an embedding-level calibration mechanism is employed to alleviate over-refusal behaviors with minimal computational overhead. Experimental results demonstrate that, compared to various defense methods across five types of jailbreak attacks, LATPC framework achieves a superior balance between safety and utility. Moreover, our analysis underscores the effectiveness of extracting safety-critical dimensions from the latent space for constructing robust refusal feature attacks.

adversarial training, large language model, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2501.10639

Country: Asia > China > Shanghai > Shanghai (0.04)

Genre: Research Report > New Finding (0.66)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine (0.93)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Robust LLM safeguarding via refusal feature adversarial training

Yu, Lei, Do, Virginie, Hambardzumyan, Karen, Cancedda, Nicola

arXiv.org Artificial IntelligenceSep-30-2024

Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature (Arditi et al., 2024). We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods. Large language models (LLMs) have achieved remarkable performance across a range of tasks and applications. However, LLMs are not always aligned with human values and can produce undesirable content.

arxiv preprint arxiv, instruction, refusal feature, (12 more...)

arXiv.org Artificial Intelligence

2409.20089

Country: North America > Canada > Ontario > Toronto (0.14)

Genre: Research Report > New Finding (0.65)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback